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Abstract 

We introduce a new principle for model selection in regression and classifica- 
tion. Many regression models are controlled by some smoothness or flexibility 
or complexity parameter c, e.g. the number of neighbors to be averaged over 
in k nearest neighbor (kNN) regression or the polynomial degree in regression 
with polynomials. Let /£, be the (best) regressor of complexity c on data D. 
A more flexible regressor can fit more data D' well than a more rigid one. If 
something (here small loss) is easy to achieve it's typically worth less. We 
define the loss rank of as the number of other (fictitious) data D' that 
are fitted better by than D is fitted by We suggest selecting the 
model complexity c that has minimal loss rank (LoRP). Unlike most penal- 
ized maximum likelihood variants (AIC,BIC,MDL), LoRP only depends on 
the regression functions and the loss function. It works without a stochastic 
noise model, and is directly applicable to any non-parametric regressor, like 
kNN. In this paper we formalize, discuss, and motivate LoRP, study it for 
specific regression problems, in particular linear ones, and compare it to other 
model selection schemes. 
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1 Introduction 



Regression. Consider a regression or classification problem in which we want to de- 
termine the functional relationship yi~ ftrue(xi) from data D = {(xi,yi),...,(x n ,y n )}£ 
V, i.e. we seek a function fa such that is close to the unknown ft r ue( x ) for all 

x. One may define regressor fa directly, e.g. 'average the y values of the k nearest 
neighbors (kNN) of z in D', or select the / from a class of functions T that has 
smallest (training) error on D. If the class T is not too large, e.g. the polynomials 
of fixed reasonable degree d, this often works well. 

Model selection. What remains is to select the right model complexity c, like k 
or d. This selection cannot be based on the training error, since the more complex 
the model (large d, small k) the better the fit on D (perfect for d = n and k — 1). 
This problem is called overfitting, for which various remedies have been suggested: 
We will not discuss empirical test set methods like cross-validation, but only 
training set based methods. See e.g. |Mac92] for a comparison of cross-validation 
with Bayesian model selection. Training set based model selection methods allow 
using all data D for regression. The most popular ones can be regarded as penalized 
versions of Maximum Likelihood (ML). In addition to the function class J 7 , one 
has to specify a sampling model P(D\f), e.g. that the yi have independent Gaus- 
sian distribution with mean f(xi). ML chooses f ^ = argmaxf^P (D\f), Penalized 
ML (PML) then chooses c = argmin c {— logP Penalty (c)}, where the penalty 
depends on the used approach (MDL |Ris78j . BIC |Sch78j . AIC [Aka73j ). In par- 
ticular, modern MDL [Grii04j has sound exact foundations and works very well in 
practice. All PML variants rely on a proper sampling model (which may be difficult 
to establish), ignore (or at least do not tell how to incorporate) a potentially given 
loss function, and are typically limited to (semi)parametric models. 

Main idea. The main goal of the paper is to establish a criterion for selecting 
the "best" model complexity c based on regressors /£, given as a black box without 
insight into the origin or inner structure of that does not depend on things 
often not given (like a stochastic noise model), and that exploits what is given (like 
the loss function). The key observation we exploit is that large classes T c or more 
flexible regressors ffj can fit more data D'gD well than more rigid ones, e.g. many 
D' can be fit well with high order polynomials. We define the loss rank of /£, as 
the number of other (fictitious) data D' aT> that are fitted better by ffj, than D is 
fitted by /£, as measured by some loss function. The loss rank is large for regressors 
fitting D not well and for too flexible regressors (in both cases the regressor fits 
many other D' better). The loss rank has a minimum for not too flexible regressors 
which fit D not too bad. We claim that minimizing the loss rank is a suitable 
model selection criterion, since it trades off the quality of fit with the flexibility of 
the model. Unlike PML, our new Loss Rank Principle (LoRP) works without a 
noise (stochastic sampling) model, and is directly applicable to any non-parametric 
regressor, like kNN. 
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Contents. In Section [2j after giving a brief introduction to regression, we formally 
state LoRP for model selection. To make it applicable to real problems, we have 
to generalize it to continuous spaces and regularize infinite loss ranks. In Section 
[3] we derive explicit expressions for the loss rank for the important class of linear 
regressors, which includes kNN, polynomial, linear basis function (LBFR), Kernel, 
and projective regression. In Section S] we compare linear LoRP to Bayesian model 
selection for linear regression with Gaussian noise and prior, and in Section [5] to 
PML, in particular MDL, BIC, AIC, and MacKay's [Mac92j and Hastie's et al. 
[HTFOlJ trace formulas for the effective dimension. In this paper we just scratch at 
the surface of LoRP. Section [6] contains further considerations, to be elaborated on 
in the future. 



2 The Loss Rank Principle (LoRP) 

After giving a brief introduction to regression, classification, model selection, over- 
fitting, and some reoccurring examples (polynomial regression Example [1] and kNN 
Example [2]), we state our novel Loss Rank Principle for model selection. We first 
state it for classification (Principle [3J for discrete values), and then generalize it 
for regression (Principle [5] for continuous values), and exemplify it on two (over- 
simplistic) artificial Examples H] and [6j Thereafter we show how to regularize LoRP 
for realistic regression problems. 

Setup. We assume data D = (x,y) := {(xi,yi),...,(x n ,y n )} G (X x y) n =:V has been 
observed. We think of the y as having an approximate functional dependence on 
x, i.e. yi~ ftrue(xi), where ~ means that the yi are distorted by noise or otherwise 
from the unknown "true" values ftrue{xi)- 

Regression and classification. In regression problems y is typically (a subset 
of) the real numbers M or some more general measurable space like M m . In clas- 
sification, y is a finite set or at least discrete. We impose no restrictions on X. 
Indeed, x will essentially be fixed and plays only a spectator role, so we will of- 
ten notationally suppress dependencies on x. The goal of regression is to find a 
function fo G T C X — > y "close" to j irue based on the past observations D. Or 
phrased in another way: we are interested in a regression function r : Z> — » JF such 
that y:—r(x\D)=r(D)(x) = fD(x)~ft r ue( x ) f° r allxeAf. 

Notation. We will write (x,y) or (xo,yo) for generic data points, use vector notation 
x = (xi,...,x n ) T and y = (yi,...,y n ) T , and D' = (x',y') for generic (fictitious) data of 
size n. 

Example 1 (polynomial regression) For X = y = M, consider the set : = 
{fw(x) =WdX d ~ 1 + ...w 2 x+Wi : w G M d } of polynomials of degree d — 1. Fitting the 
polynomial to data D, e.g. by least squares regression, we estimate w with wjj. The 
regression function y = r<i(x\D) = fa, D {x) can be written down in closed form (see 
Example M). 
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Example 2 (k nearest neighbors, kNN) Let y be some vector space like M 
and X be a metric space like M m with some (e.g. Euclidian) metric d(-,-). kNN 
estimates /t rue (x) by averaging the y values of the k nearest neighbors A4(x) of x 
in D, i.e. r&(x|.D) = |X)ieA4(a;)^ with \J\fk(x)\ = k such that <d(x,Xj) for all 

iG7V fc (x) and j<£Af h (x). * 

Parametric versus non-parametric regression. Polynomial regression is an 
example of parametric regression in the sense that r^D) is the optimal function 
from a family of functions indexed by d<oo real parameters (w). In contrast, 
the kNN regressor r k is directly given and is not based on a finite-dimensional 
family of functions. In general, r may be given either directly or be the result of an 
optimization process. 

Loss function. The quality of fit to the data is usually measured by a loss 
function Loss(y,y), where yi = foi^i) is an estimate of ^. Often the loss is ad- 
ditive: Loss(y,y) = ^™ =1 Loss(?/i,?/j). If the class T is not too large, good re- 
gressors r can be found by minimizing the loss w.r.t. all / G T. For instance, 
r d (D) =argmm f( zjr d ^ =1 (y i - f(xi)) 2 and y = r d (x\D) in Example [Q 

Regression class and loss. In the following we assume a (typically countable) 
class of regressors 7Z (whatever their origin), e.g. the kNN regressors {r k : k G IN} 
or the least squares polynomial regressors {r^ : d G W }. Note that unlike /GJ 7 , 
regressors r eTZ are not functions of x alone but depend on all observations D, in 
particular on y. Like for functions /, we can compute the loss of each regressor 
reTZ: 

n 

Loss r (-D) = Loss r (y|a;) := Loss(y, y) = ^] Loss ( yj , r (xj \ x , y) ) 

i=i 

where yi = r(xi\D) in the third expression, and the last expression holds in case of 
additive loss. 

Overfitting. Unfortunately, minimizing Loss r w.r.t. r will typically not select the 
"best" overall regressor. This is the well-known overfitting problem. In case of 
polynomials, the classes Td C Fd+i are nested, hence Loss rd is monotone decreasing 
in d with Loss rn = perfectly fitting the data. In case of kNN, Loss rfc is more 
or less an increasing function in k with perfect regression on D for k — 1, since no 
averaging takes place. In general, TZ is often indexed by a "flexibility" or smoothness 
or complexity parameter, which has to be properly determined. More flexible r can 
closer fit the data and hence have smaller loss, but are not necessarily better, since 
they have higher variance. Clearly, too inflexible r also lead to a bad fit ("high 
bias"). 

Main goal. The main goal of the paper is to establish a selection criterion for the 
"best" regressor reTZ 
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• based on r given as a black box that does not require insight into the origin 
or inner structure of r, 

• that does not depend on things often not given (like a stochastic noise model), 

• that exploits what is given (like the loss function). 

While for parametric (e.g. polynomial) regression, MDL and Bayesian methods work 
well (effectively the number of parameters serve as complexity penalty), their use is 
seriously limited for non-parametric black box r like kNN or if a stochastic/coding 
model is hard to establish (see Section H] for a detailed comparison). 

Main idea: loss rank. The key observation we exploit is that a more flexible r 
can fit more data D'eD well than a more rigid one. For instance, can perfectly 
fit all D' for d—n, all D 1 that lie on a parabola for d—3, but only linear D' for d—2. 
We consider discrete y i.e. classification first, and fix x. y is the observed data and 
y' are fictitious others. 

Instead of minimizing the unsuitable Loss r (y\x) w.r.t. r, we could ask how many 
y' &y n lead to smaller Loss r than y. Many y' have small loss for flexible r, and so 
smallness of Loss r is less significant than if y is among very few other y' with small 
Loss r . We claim that the loss rank of y among all y' Ey n is a suitable measure of 
fit. We define the rank of y under r as the number of y' Gj" with smaller or equal 
loss than y : 

Rank r (y|a;) = Rank r (L) := G y n : Loss r (y» < L}, (1) 

where L := Loss r (y\x) 

For this to make sense, we have to assume (and will later assure) that Rank r (L) < oo, 
i.e. there are only finitely many y' G y n having loss smaller than L. In a sense, 
p = Rank r (y \x) measures how compatible y is with r; y is the pth most compatible 
with r. 

Since the logarithm is a strictly monotone increasing function, we can also con- 
sider the logarithmic rank LK r (y\x) := logRank r (7/|a;), which will be more conve- 
nient. 

Principle 3 (loss rank principle (LoRP) for classification) For discrete y, 
the best classifier /regress or r :Vx X -^y in some class 1Z for data D = (x,y) is 
the one of smallest loss rank: 

r best = argminLR. r (^/|a;) = argminRank r (^|cc) (2) 

reTZ reTZ 

where Rank r is defined in (Qp. 

We give a simple example for which we can compute all ranks by hand to help 
better grasping how the principle works, but the example is too simplistic to allow 
any conclusion on whether the principle is appropriate. 
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Example 4 (simple discrete) Consider X = {1,2}, y = {0,1,2}, and two points 
D={(1,1),(2,2)} lying on the diagonal x=y, with polynomial (zero, constant, linear) 
least squares regression 71= {r ,ri,r 2 } (see Ex{TJ). r is simply 0, r\ the y-average, 
and r 2 the line through points (1,2/1) and (2,y 2 ). This, together with the quadratic 
Loss for generic y' and observed y = (1,2) (and fixed x = (1,2)), is summarized in 
the following table 



d 


r d (x\x,y') 


LoSS d (t/| X) 


Loss d (£>) 








y? + y? 


5 


1 


|(yi + y 2 ) 


m-yi) 2 


i 

2 


2 


(y 2 -yi)(x-l)+yi 









From the Loss we can easily compute the Rank for all nine t/'e{0,l,2} 2 . Equal rank 
due to equal loss is indicated by a = in the table below. Whole equality groups are 
actually assigned the rank of their right-most member, e.g. for d=l the ranks of 
(^,y 2 ) = (0,l),(l,0),(2,l),(l,2) are all 7 (and not 4,5,6,7). 









Rank rd (^ 2 |12) 




d 




1 2 


3 4 5 6 7 8 9 


Rank rd (D) 





y'iy'2 


= 00 < 01 


= 10 < 11 < 02 = 20 < 21 = 12 < 22 


8 


1 


M 


= 00 = 11 


= 22 < 01 = 10 = 21 = 12 < 02 = 20 


7 


2 




= 00 = 01 


= 02 = 10 = 11 = 20 = 21 = 22 = 12 
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So LoRP selects T\ as best regressor, since it has minimal rank on D. r fits D too 
badly and r 2 is too flexible (perfectly fits all D'). <0 

LoRP for continuous y. We now consider the case of continuous or measurable 
spaces y, i.e. normal regression problems. We assume y = M in the following expo- 
sition, but the idea and resulting principle hold for more general measurable spaces 
like M m . We simply reduce the model selection problem to the discrete case by con- 
sidering the discretized space y e = E/Z for small e>0 and discretize y^y £ EeZ n . 
Then Rank^(L) := #{y' e G y™ : Loss r (y' £ \x) < L} with L = Loss r (y £ \x) counts the 
number of e-grid points in the set 

V r (L) := {y' G y n : Loss r (s/» < L} (3) 

which we assume (and later assure) to be finite, analogous to the discrete case. 
Hence Rank^(L)-e n is an approximation of the loss volume \V r (L)\ of set V r (L), 
and typically Rank^(L)-e n = \V r (L)\-(l + 0(e)) -> \V r (L)\ for e -> 0. Taking the 
logarithm we get LR^ (y \ x) = logRank^(L) = log|K.(L)| —n\oge+ 0(e). Since nlog£ is 
independent of r, we can drop it in comparisons like (j2J). So for e— >0 we can define 
the log-loss "rank" simply as the log-volume 

LR r (y\x) := log |K.(L)|, where L := Loss r (y\x) (4) 
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Principle 5 (loss rank principle for regression) For measurable y , the best 
regressor r :Vx X ^y in some class 71 for data D = (x,y) is the one of small- 
est loss volume: 

r best = argminLR r (2/|a;) = argmin |V^.(L)| 
where LR ; V r , and L are defined in |3j] and Q), and \ V r (L)\ is the volume ofV r (L)G 

y n . 

For discrete y with counting measure we recover the discrete Loss Rank Principle 

El 

Example 6 (simple continuous) Consider Example Hlbut with interval 3^= [0,2]. 
The first table remains unchanged, while the second table becomes 



d 


V d (L) = {y'e[0,2f:...} 




V d (L)\ 




V d (Loss d (D))\ 




1 

2 


y[ 2 + y' 2 2 <L 

\{y' 2 -y',Y<L 
< L 


2-^/max{L-4,0} + 
L(f-co S -l(mm{^,l})) 

4 v / 2l- 2L 
4 


« 3.6 

3 
4 



So LoRP again selects r% as best regressor, since it has smallest loss volume on D. 



Infinite rank or volume. Often the loss rank/volume will be infinite, e.g. if we 
had chosen y = 1Z in ExJH or y = M in Exj6l We will encounter such infinities in 
Section El There are various potential remedies. We could modify (a) the regressor 
r or (b) the Loss to make LR r finite, (c) the Loss Rank Principle itself, or (d) find 
problem-specific solutions. Regressors r with infinite rank might be rejected for 
philosophical or pragmatic reasons. We will briefly consider (a) for linear regression 
later, but to fiddle around with r in a generic (blackbox way) seems difficult. We 
have no good idea how to tinker with LoRP (c), and also a patched LoRP may be 
less attractive. For kNN on a grid we later use remedy (d). While in (decision) 
theory, the application's goal determines the loss, in practice the loss is often more 
determined by convenience or rules of thumb. So the Loss (b) seems the most 
inviting place to tinker with. A very simple modification is to add a small penalty 
term to the loss. 

Loss r (y\x) ^ Loss® (y\x) := Loss r (y\x) + a\\y\\ 2 , a > "small" (5) 

The Euclidian norm ||^|| 2 : = ^r=i% 2 ^ s default, but other (non)norm regularizes are 
possible. The regularized LR"(y\x) based on Loss" is always finite, since {y : ||y|| 2 < 
L} has finite volume. An alternative penalty ay y, quadratic in the regression 
estimates yi = r(xi\x,y) is possible if r is unbounded in every y^-oo direction. 

A scheme trying to determine a single (flexibility) parameter (like d and k in 
the above examples) would be of no use if it depended on one (or more) other 
unknown parameters (a), since varying through the unknown parameter leads to 
any (non) desired result. Since LoRP seeks the r of smallest rank, it is natural to 
also determine a by minimizing LR" w.r.t. a. The good news is that this leads to 
meaningful results. 
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3 LoRP for Linear Models 



In this section we consider the important class of linear regressors with quadratic 
loss function. Since linearity is only assumed in y and the dependence on x can be 
arbitrary, this class is richer than it may appear. It includes kNN ( Example EJ), kernel 
(Example [8]), and many other regressors. For linear regression and y = M, the loss 
rank is the volume of an n-dimensional ellipsoid, which can efficiently be computed 
in time 0(n 3 ) (Theorem [TOl) . For the special case of projective regression, e.g. linear 
basis function regression (Example [HD, we can even determine the regularization 
parameter a analytically (Theorem [TTi) . 

Linear regression. We assume y = M in this section; generalization to M m is 
straightforward. A linear regressor r can be written in the form 

y = r(x\x,y) = ^^mj(x, x)yj Vx G X and some rrij : X x X n — > M (6) 

3=1 

Particularly interesting is r for x—Xi,...,x n . 

y t = r( Xi \x,y) = J2 M ij( x )yj with M:X n ^R nxn (7) 

3 

where matrix My(ac) = rrij(xi,x). Since LoRP needs r only on the training data x, 
we only need M. 

Example 7 (kNN ctd.) For kNN of Exfjwe have m j (x ) x) = \ if jej\f k (x) and 
else, and Mij(x) = j- if jEj\fk(xi) and else. ^> 
Example 8 (kernel regression) Kernel regression takes a weighted average over 
y, where the weight of yj to y is proportional to the similarity of Xj to x, measured by 
a kernel K(x,Xj), i.e. rrij(x,x) —K{x,Xj)/Y^j = \K{x,Xj). For example the Gaussian 
kernel for X = IR m is K(x,xj) = e -W x - x M/^\ <> 

Example 9 (linear basis function regression, LBFR) Let (pi(x),...,(j) c i(x) be a 
set or vector of "basis" functions often called "features". We place no restrictions 
on X or (f>:X^]R d . Consider the class of functions linear in </>: 

Fd = {f W ( X ) = Yl=l W a<Pa{.x) = W T (j)(x) : W G R d } 

For instance, for X = M and 4> a (x) =x a ~ 1 we would recover the polynomial regression 
Example [TJ For quadratic loss function Loss(?/j,yj) = (?/j — fji) 2 we have 

n 

Loss w (y\<(>) := ^(yi - f w {xi)) 2 = y T y - 2y T ®w + w T Bw 

where matrix $ is defined by $j a = <fi a {xi) and B is a symmetric matrix with 
B a b = Y^i=i ( l ) a{xi) < fib{xi) = &\ab- The loss is quadratic in w with minimum at 
w = B 1-1 <& T ?/. So the least squares regressor is y = y T &B~ 1 <p(x), hence rrij(x,x) = 
{^B- l (j){x))j and M(x) = <$>B- 1 ® Y . 
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Consider now a general linear regressor M with quadratic loss and quadratic 
penalty 



2 

Loss^(?/|a;) = \y t - E?=i^%%) +«ll?/l| 2 = y T S a y, 

i=l 

wher<£] S a = (1 — M) T (1 — M) + al 



(1 is the identity matrix). is a symmetric matrix. For a>0 it is positive definite 

and for a = positive semidefinite. If Ai,...A n >0 are the eigenvalues of So, then 

Xi+a are the eigenvalues of S a . V(L) = {y' &M n :y' T S a y' <L} is an ellipsoid with the 

eigenvectors of S a being the main axes and y/Lf (Xi+a) being their length. Hence 

the volume is 

-JL I L D T n / 2 

\v(l)\ = v n r\J— — = ^L= 

fj" V Ai + « v/det^ 

where v n = 7i n ^ 2 /^! is the volume of the n-dimensional unit sphere, z\:—T(z+l), and 
det is the determinant. Taking the logarithm we get 

LR a M (y\x) = log|y(Loss^(y|a;))| = f log(y T S a y) - \ logdet S a + \ogv n (9) 

Consider now a class of linear regressors TZ = {M}, e.g. the kNN regressors {M^-.kE 
IN} or the d-dimensional linear basis function regressors {Md'.d^lNo}. 

Theorem 10 (LoRP for linear regression) Fory = M, the best linear regressor 
M:X n ^ M nxn in some class M for data D = (x,y) is 

M best = argmin{f \og(y T S a y) - ±\ogdet S a } = argmin 1 -^-^— 1 (10) 

where S a is defined in 

Since v n is independent of a and M it was possible to drop v n . The last expression 
shows that linear LoRP minimizes the Loss times the geometric average of the 
squared axes lengths of ellipsoid V(l). Note that M best depends on y unlike the 
MeM. 

Nullspace of Sq. If M has an eigenvalue 1, then So = (1 — M) T (1 — M) has a 
zero eigenvalue and a > is necessary, since detSo = 0. Actually this is true for 
most practical M. Nearly all linear regressors are invariant under a constant shift 
of y, i.e. r(yi + c\D) =r(y i \D)+c ) which implies that M has eigenvector (1,...,1) T 
with eigenvalue 1. This can easily be checked for kNN (ExJ2J), Kernel (ExJHJ), and 



1 The mentioned alternative penalty a||y|| 2 would lead to S a — (1 — AI) T (TL — M)+aM T M . For 
LBFR, penalty a||t(;|| 2 is popular (ridge regression). Apart from being limited to parametric 
regression, it has the disadvantage of not being reparametrization invariant. For instance, scaling 
</ , a(^)^7o</>a(a ; ) doesn't change the class but changes the ridge regressor. 
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LBFR (Ex|9]). Such a generic 1-eigenvector effecting all M G M. could easily and 
maybe should be filtered out by considering only the orthogonal space or dropping 
these Aj = when computing detSo- The 1-eigenvectors that depend on M are the 
ones where we really need a regularizer a > for. For instance, in LBFR has d 
eigenvalues 1, and M^nn has as many eigenvalues 1 as there are disjoint components 
in the graph determined by the edges > In general we need to find the optimal 
a numerically. If M is a projection we can find a m « n analytically. 

Projective regression. Consider a projection matrix M = P = P 2 with d = trP 
eigenvalues 1, and n—d zero eigenvalues. For instance, M = QB~ 1 Q Y of LBFR Exj9] 
is such a matrix, since M$ = $ and M\I/ = for ^ such that <& T \l/ = 0. This implies 
that S a has d eigenvalues a and n — d eigenvalues 1 + a. Hence 

detS a = a d (l + a) n ~ d , where S a = S + al = 1 - P + al 

To / , A t i y Ts ov 1 y Tp v 

y S a y = (p + a)y y, where p := — = — = 1 

y l y y l y 

LRp = flog/y + f log(p + a)-f loga- ^log(l + a) (11) 

The first term is independent of a. Consider 1 — p> -, the reasonable region in 
practice. Solving 8LRp/da = w.r.t. a we get a minimum at a = a min := ^_^ n _ d - 
After some algebra we get 

LRp min = |logt/ T ?/ — |KL(^||1 — p), where KL(p\\q) = plog J + (1 -p)log^ 

(12) 

is the relative entropy or Kullback-Leibler divergence. Minimizing LRp™ 4 " w.r.t. M 
is equivalent to maximizing KL(-||1— p). This is an unusual task, since one mostly 
encounters D minimizations. For fixed d, LRp™ n is monotone increasing in p. Since 
Lossp oc p + a, LoRP suggests to minimize Loss for fixed model dimension d. For 
fixed p, LRp min is monotone increasing in d, i.e. LoRP suggests to minimize model 
dimension d for fixed Loss. Normally there is a tradeoff between minimizing d and 
p, and LoRP suggests that the optimal choice is the one that maximizes KL. 

Theorem 11 (LoRP for projective regression) The best projective regressor 
P:X n —>IR nxn with P = P 2 in some projective class V for data D = (x,y) is 

P best = argmax KL(^^||^^), provided ^ < ^p- 
p^-p v n 11 y y ' n y y 



4 Comparison to Gaussian Bayesian Linear Re- 
gression 

We now consider linear basis function regression (LBFR) from a Bayesian perspec- 
tive with Gaussian noise and prior, and compare it to LoRP. In addition to the noise 
model as in PML, one also has to specify a prior. Bayesian model selection (BMS) 
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proceeds by selecting the model that has largest evidence. In the special case of 
LBFR with Gaussian noise and prior and an ML-II estimate for the noise variance, 
the expression for the evidence has a similar structure as the expression of the loss 
rank. 

Gaussian Bayesian LBFR / MAP. Recall from SecJH] Exj9] that T<i is the class 
of functions f w (x) = w T <p(x) (wEM d ) that are linear in feature vector (f>. Let 

, , *. exp(-±(z - fiyz-^z - fj)) , 

Gauss/v(z III, E) := (13) 

V ' ; (27r)^/ 2 ydetS V ' 

denote a general iV-dimensional Gaussian distribution with mean /j, and covariance 
matrix S. We assume that observations y are perturbed from f w (x) by independent 
additive Gaussian noise with variance and zero mean, i.e. the likelihood of y 
under model w is P(y\w) = Gauss n (y|$iu, / 9 _1 1), where <3? ia = <j> a {xi). A Bayesian 
assumes a prior (before seeing y) distribution on w. We assume a centered Gaussian 
with covariance matrix (aC) _1 , i.e. P(w) = Gauss,i(iu|0,a _1 C _1 ). From the prior 
and the likelihood one can compute the evidence and the posterior 

Evidence: P(y) = J P(y\w)P(w)dw = Gauss n (y|0, /T 1 ^ 1 ) (14) 

Posterior: P(w\y) = P(y\w)P(w)/P(y) = G&ussd(w\w, A^ 1 ) 

B :=® T <$>, A:=aC + f3B, M := /3$A _1 # T , S:=l-M, (15) 

w := f3A^ 1 ^ Y y, y := <$>w = My 

A standard Bayesian point estimate for w for fixed d is the one that maximizes 
the posterior (MAP) (which in the Gaussian case coincides with the mean) w = 
&igmax w P(w\y) = (3A~ x ^y. For a — > 0, MAP reduces to Maximum Likelihood 
(ML), which in the Gaussian case coincides with the least squares regression of 
Exj9l For ct>0, the regression matrix M is not a projection anymore. 

Bayesian model selection. Consider now a family of models {J r i,J r 2,---}- Here 
the Td are the linear regressors with d basis functions, but in general they could 
be completely different model classes. All quantities in the previous paragraph 
implicitly depend on the choice of J 7 , which we now explicate with an index. In 
particular, the evidence for model class T is Pp{y). Bayesian Model Selection 
(BMS) chooses the model class (here d) T of highest evidence: 

jtBms = argmaxP^y) 

Once the model class JF BMS is determined, the MAP (or other) regression function 
fwj-sus or Mp-BMs are chosen. The data variance may be known or estimated 
from the data, C is often chosen 1, and a has to be chosen somehow. Note that 
while a — > leads to a reasonable MAP=ML regressor for fixed d, this limit cannot 
be used for BMS. 
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Comparison to LoRP. Inserting ([13]) into (TH|) and taking the logarithm we see 
that BMS minimizes 

- logP^j/) = f y T Sy - \ logdet 5 - § log £ (16) 

w.r.t. T. Let us estimate (3 by ML: We assume a broad prior a<^C/9 so that /3§§ = 
0(|) can be neglected. Then a ° g ^^ =|y T gy-ii+O(|n) = <S> (3^(3:=n/{y T Sy). 
Inserting /3 into ([TBI) we get 

- log PAV) = f log 2/ T ^2/ - I log det 5 - f log £ (17) 

Taking an improper prior P(/3) oc/3 _1 and integrating out /? leads for small a to a 
similar result. The last term in ( flTl) is a constant independent of T and can be 
ignored. The first two terms have the same structure as in linear LoRP ({TO]) , but 
the matrix S is different. In both cases, a act as regularizers, so we may minimize 
over a in BMS like in LoRP. For a = (which neither makes sense in BMS nor in 
LoRP), M in BMS coincides with M of Exj9j but still the Sq in LoRP is the square 
of the S in BMS. For a>0, M of BMS may be regarded as a regularized regressor 
as suggested in Secf2] (a), rather than a regularized loss function (b) used in LoRP. 
Note also that BMS is limited to (semi)parametric regression, i.e. does not cover the 
non-parametric kNN Exj2]and Kernel ExJHl unlike LoRP. 

Since B only depends on x (and not on y), and all P are implicitly conditioned 
on x, one could choose C = B. In this case, M = 7$5 _1 $ T , with 7 = < 1 
for a>0, is a simple multiplicative regularization of projection $5 _1 $ T , and ffT7|) 
coincides with (II II) for suitable a, apart from an irrelevant additive constant, hence 
minimizing ( TT7T) over a also leads to (Tl2l) . 

5 Comparison to other Model Selection Schemes 

In this section we give a brief introduction to Penalized Maximum Likelihood (PML) 
for (semi)parametric regression, and its major instantiations, the Akaike and the 
Bayesian Information Criterion (AIC and BIC), and the Minimum Description 
Length (MDL) principle, whose penalty terms are all proportional to the number of 
parameters d. The effective number of parameters is often much smaller than d, e.g. 
if there are soft constraints like in ridge regression. We compare MacKay's [Mac92j 
trace formula for Gaussian Bayesian LBFR and Hastie's et al. |HTF01] trace formula 
for general linear regression with LoRP. 

Penalized ML (AIC, BIC, MDL). Consider a d-dimensional stochastic model 
class like the Gaussian Bayesian linear regression example of SectionlH Let Pd(y\ w ) 
be the data likelihood under d- dimensional model w£M d . The maximum likelihood 
(ML) estimator for fixed d is 

w = aigm&xP d{y\w) = argmin{— logP^(y|it;)} 
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Since — logPd(y|u;) decreases with d, we cannot find the model dimension by sim- 
ply minimizing over d (overfitting) . Penalized ML adds a complexity term to get 
reasonable results 

d = argminj— \ogPd(y\w) + Penalty (d)} 

d 

The penalty introduces a tradeoff between the first and second term with a minimum 
at d<oo. Various penalties have been suggested: The Akaike Information Criterion 
(AIC) |Aka73j uses d, the Bayesian Information Criterion (BIC) |Sch78j and the 
(crude) Minimum Description Length (MDL) principle use |logn |Ris78l IGru04] for 
Penalty (d). There are at least three important conceptual differences to LoRP: 

• In order to apply PML one needs to specify not only a class of regression 
functions, but a full probabilistic model P^(y\w), 

• PML ignores or at least does not tell how to incorporate a potentially given 
loss-function, 

• PML (AIC, BIC, MDL) is mostly limited to (semi)parametric models (with d 
"true" parameters). 

We discuss two approaches to the last item in the remainder of this section: AIC, 
BIC, and MDL are not directly applicable (a) for non-parametric models like kNN 
or Kernel regression, or (b) if d does not reflect the "true" complexity of the model. 
For instance, ridge regression can work even for d larger than n, because a penalty 
pulls most parameters towards (but not exactly to) zero. MacKay [Mac92j suggests 
an expression for the effective number of parameters d e ff as a substitute for d in case 
of (b), and Hastie et al. |HTF01] more generally also for (a). 

The trace penalty for parametric Gaussian LBFR. We continue with the 
Gaussian Bayesian linear regression example (see Section H] for details and notation). 
Performing the integration in (TH|) . MacKay |Mac92| Eq.(21)] derives the following 
expression for the Bayesian evidence for C = 1 

-\ogP(y) = (aE w + (3E D ) + (±\ogdet A -llog a) log^ (18) 
E D = ±\\$w-y\\l, E w = l\\w\\ 2 2 

(the first bracket in (TIB"]) equals ^y T Sy and the second equals — TjlogdetS 1 , cf. ( 1161) ), 
Minimizing (fT8l) w.r.t. a leads to the following relation: 

= = E w + \txA^-£ (£\ogdetA = trA^) 

He argues that a||tu||2 corresponds to the effective number of parameters, hence 

rfMcK . = a ||^||2 = 2a E w = d-atrA- 1 (19) 

The trace penalty for general linear models. We now return to general linear 
regression y = M(x)y (T7j). LBFR is a special case of a projection matrix M = M 2 
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with rank d = trM being the number of basis functions. M leaves d directions 
untouched and projects all other n — d directions to zero. For general M, Hastie et 
al. [HTFOll Sec. 5. 4.1] argue to regard a direction that is only somewhat shrunken, 
say by a factor of < (3 < 1, as a fractional parameter (/3 degrees of freedom). If 
/3x,...,f3 n are the shrinkages = eigenvalues of M, the effective number of parameters 
could be defined as [H TFOlt Sec. 7. 6] 

n 

4J F : = = trM 

which generalizes the relation <i = trM beyond projections. For MacKay's M (1151) . 
trM = d—txA~ 1 , i.e. rf^T is consistent with and generalizes d}^S . 

Problems. Though nicely motivated, the trace formula is not without problems. 
First, since for projections, M = M 2 , one could equally well have argued for d^J = 
trM 2 . Second, for kNN we have trM = ^ (since M is | on the diagonal), which does 
not look unreasonable. Consider now kNN' where we average over the k nearest 
neighbors excluding the closest neighbor. For sufficiently smooth functions, kNN' 
for suitable k is still a reasonable regressor, but trM = (since M is zero on the 
diagonal). So d^J F = for kNN', which makes no sense and would lead one to always 
select the k = l model. 

Relation to LoRP. In the case of kNN', trM 2 would be a better estimate for 
the effective dimension. In linear LoRP, — logdetS'o, serves as complexity penalty. 
Ignoring the nullspace of S = (1—M) T (1—M) (jSJ), we can Taylor expand — |logdetSo 
in M 

oo 

-ilogdetSo = -trlog(U-M) = ±tr(M s ) = trM + |trM 2 + ... 

8=1 

For BMS (TTT1) with S=1 — M ( TT5|) we get half of this value. So the trace penalty 
may be regarded as a leading order approximation to LoRP. The higher order terms 
prevent peculiarities like in kNN'. 

6 Outlook 

So far we have only scratched at the surface of the Loss Rank Principle. LoRP 
seems to be a promising principle with a lot of potential, leading to a rich field. 
In the following we briefly summarize miscellaneous considerations, which may be 
elaborated on in the future: Experiments, Monte Carlo estimates for non-linear 
LoRP, numerical approximation of detS'a, LoRP for classification, self-consistent 
regression, explicit expressions for kNN on a grid, loss function selection, and others. 

Experiments. Preliminary experiments on selecting k in kNN regression confirm 
that LoRP selects a "good" k. (Even on artificial data we cannot determine whether 
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the "right" k is selected, since kNN is not a generative model). LoRP for LBFR 
seems to be consistent with rapid convergence. 

Monte Carlo estimates for non-linear LoRP. For non-linear regression we 
did not present an efficient algorithm for the loss rank/volume LR r (y\x). The 
high-dimensional volume |V r (L)| ([3]) may be computed by Monte Carlo algorithms. 
Normally V r (L) constitutes a small part of y n , and uniform sampling over y n is not 
feasible. Instead one should consider two competing regressors r and r' and compute 
| VTlV \/\V\ and | VTlV'|/| V'| by uniformly sampling from V and V respectively e.g. 
with a Metropolis-type algorithm. Taking the ratio we get |V'|/|V| and hence the 
loss rank difference LR r — LR r /, which is sufficient for LoRP. The usual tricks and 
problems with sampling apply here too. 

Numerical approximation of det^. Even for linear regression, a Monte Carlo 
algorithm may be faster than the naive 0(n 3 ) algorithm [BFG96]. Often M is a 
very sparse matrix (like in kNN) or can be well approximated by a sparse matrix 
(like for Kernel regression), which allows to approximate detSa, sometimes in linear 
time [Reu02j . 

LoRP for classification. A classification problem is or can be regarded as a 
regression problem in which y is finite. This implies that we need to compute 
(count) LR r for non-linear r somehow, e.g. approximately by Monte Carlo. 

Self-consistent regression. So far we have considered only "on-data" regression. 
LoRP only depends on the regressor r on data D and not on x$.{xi,...,x n }. One can 
construct canonical regressors for off-data x from regressors given only on-data in the 
following way: We add a virtual data point (x,y) to D, where x is the off-data point 
of interest. If we knew y we could estimate y = r(x\{(x,y)}UD), but we don't know 
y. But if we require consistency, namely that y = y, we get a canonical estimate for 
y. First, this bootstrap may ease the specification of the regression models, second, 
it is a canonical way for interpolation (LoRP can't distinguish between r that are 
identical on D), and third, many standard regressors (kNN, Kernel, LBFR) are 
self-consistent in the sense that they are canonical. 

Explicit expressions for kNN on a grid. In order to get more insight into 
LoRP, a case that allows an analytic solution is useful. For k nearest neighbors 
classification with x lying on a hypercube of the regular grid X = 2Z d one can derive 
explicit expressions for the loss rank as a function of k, n, and d. For n3>/c3>3 d , the 
penalty — ilogdetS 1 is proportional to trM with proportionality constant decreasing 
from about 3.2 for d = 1 to 1.5 for d— >oo. 

LoRP for hybrid model classes. LoRP is not restricted to model classes indexed 
by a single integral "complexity" parameter, but may be applied more generally to 
selecting among some (typically discrete) class of models/regressors. For instance, 
the class could contain kNN and polynomial regressors, and LoRP selects the com- 
plexity and type of regressor (non-parametric kNN versus parametric polynomials). 
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General additive loss. Linear LoRP y = M(x)y of Section [3] can easily be gener- 
alized from quadratic to p-norm Lossm(v\x) = \\y — y\\p (any p). For a = 0, y T Soy 
in (J9]) becomes \ \y— y\\^ and v p the volume of the unit d- dimensional p-norm "ball". 
Useful expressions for general additive Losstv = Y^i^iVi ~ Vi) can a l so De derived. 
Regularization may be performed by M^^M with optimization over 7< 1. 

Loss-function selection. In principle, the loss function should be part of the 
problem specification, since it characterizes the ultimate goal. In reality, though, 
having to specify the loss function can be a nuisance. We could interpret the regu- 
larized loss (pj]) as a class of loss functions parameterized by a, and argmin Q LR" as 
a loss function optimization or selection. This suggests to choose in general the loss 
function that has minimal loss rank. This leads to sensible results if the considered 
class of loss functions is not too large (e.g. all p-norm losses in the previous para- 
graph). So LoRP can be used not only for model selection, but also for loss function 
selection. 
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